{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regression - Auto Imports\n",
    "\n",
    "This sample notebook is based on the Gallery [Sample 6: Train, Test, Evaluate\n",
    "for Regression: Auto Imports\n",
    "Dataset](https://gallery.cortanaintelligence.com/Experiment/670fbfc40c4f44438bfe72e47432ae7a)\n",
    "for AzureML Studio.  This experiment demonstrates how to build a regression\n",
    "model to predict the automobile's price.  The process includes training, testing,\n",
    "and evaluating the model on the Automobile Imports data set.\n",
    "\n",
    "This sample demonstrates the use of several members of the synapseml library:\n",
    "- [`TrainRegressor`\n",
    "  ](https://mmlspark.blob.core.windows.net/docs/1.0.10/pyspark/synapse.ml.train.html?#module-synapse.ml.train.TrainRegressor)\n",
    "- [`SummarizeData`\n",
    "  ](https://mmlspark.blob.core.windows.net/docs/1.0.10/pyspark/synapse.ml.stages.html?#module-synapse.ml.stages.SummarizeData)\n",
    "- [`CleanMissingData`\n",
    "  ](https://mmlspark.blob.core.windows.net/docs/1.0.10/pyspark/synapse.ml.featurize.html?#module-synapse.ml.featurize.CleanMissingData)\n",
    "- [`ComputeModelStatistics`\n",
    "  ](https://mmlspark.blob.core.windows.net/docs/1.0.10/pyspark/synapse.ml.train.html?#module-synapse.ml.train.ComputeModelStatistics)\n",
    "- [`FindBestModel`\n",
    "  ](https://mmlspark.blob.core.windows.net/docs/1.0.10/pyspark/synapse.ml.automl.html?#module-synapse.ml.automl.FindBestModel)\n",
    "\n",
    "First, import the pandas package so that we can read and parse the datafile\n",
    "using `pandas.read_csv()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = spark.read.parquet(\n",
    "    \"wasbs://publicwasb@mmlspark.blob.core.windows.net/AutomobilePriceRaw.parquet\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To learn more about the data that was just read into the DataFrame,\n",
    "summarize the data using `SummarizeData` and print the summary.  For each\n",
    "column of the DataFrame, `SummarizeData` will report the summary statistics\n",
    "in the following subcategories for each column:\n",
    "* Feature name\n",
    "* Counts\n",
    "  - Count\n",
    "  - Unique Value Count\n",
    "  - Missing Value Count\n",
    "* Quantiles\n",
    "  - Min\n",
    "  - 1st Quartile\n",
    "  - Median\n",
    "  - 3rd Quartile\n",
    "  - Max\n",
    "* Sample Statistics\n",
    "  - Sample Variance\n",
    "  - Sample Standard Deviation\n",
    "  - Sample Skewness\n",
    "  - Sample Kurtosis\n",
    "* Percentiles\n",
    "  - P0.5\n",
    "  - P1\n",
    "  - P5\n",
    "  - P95\n",
    "  - P99\n",
    "  - P99.5\n",
    "\n",
    "Note that several columns have missing values (`normalized-losses`, `bore`,\n",
    "`stroke`, `horsepower`, `peak-rpm`, `price`).  This summary can be very\n",
    "useful during the initial phases of data discovery and characterization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from synapse.ml.stages import SummarizeData\n",
    "\n",
    "summary = SummarizeData().transform(data)\n",
    "summary.toPandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Split the dataset into train and test datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# split the data into training and testing datasets\n",
    "train, test = data.randomSplit([0.6, 0.4], seed=123)\n",
    "train.limit(10).toPandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now use the `CleanMissingData` API to replace the missing values in the\n",
    "dataset with something more useful or meaningful.  Specify a list of columns\n",
    "to be cleaned, and specify the corresponding output column names, which are\n",
    "not required to be the same as the input column names. `CleanMissiongData`\n",
    "offers the options of \"Mean\", \"Median\", or \"Custom\" for the replacement\n",
    "value.  In the case of \"Custom\" value, the user also specifies the value to\n",
    "use via the \"customValue\" parameter.  In this example, we will replace\n",
    "missing values in numeric columns with the median value for the column.  We\n",
    "will define the model here, then use it as a Pipeline stage when we train our\n",
    "regression models and make our predictions in the following steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from synapse.ml.featurize import CleanMissingData\n",
    "\n",
    "cols = [\"normalized-losses\", \"stroke\", \"bore\", \"horsepower\", \"peak-rpm\", \"price\"]\n",
    "cleanModel = (\n",
    "    CleanMissingData().setCleaningMode(\"Median\").setInputCols(cols).setOutputCols(cols)\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we will create two Regressor models for comparison: Poisson Regression\n",
    "and Random Forest.  PySpark has several regressors implemented:\n",
    "* `LinearRegression`\n",
    "* `IsotonicRegression`\n",
    "* `DecisionTreeRegressor`\n",
    "* `RandomForestRegressor`\n",
    "* `GBTRegressor` (Gradient-Boosted Trees)\n",
    "* `AFTSurvivalRegression` (Accelerated Failure Time Model Survival)\n",
    "* `GeneralizedLinearRegression` -- fit a generalized model by giving symbolic\n",
    "  description of the linear predictor (link function) and a description of the\n",
    "  error distribution (family).  The following families are supported:\n",
    "  - `Gaussian`\n",
    "  - `Binomial`\n",
    "  - `Poisson`\n",
    "  - `Gamma`\n",
    "  - `Tweedie` -- power link function specified through `linkPower`\n",
    "Refer to the\n",
    "[Pyspark API Documentation](http://spark.apache.org/docs/latest/api/python/)\n",
    "for more details.\n",
    "\n",
    "`TrainRegressor` creates a model based on the regressor and other parameters\n",
    "that are supplied to it, then trains data on the model.\n",
    "\n",
    "In this next step, Create a Poisson Regression model using the\n",
    "`GeneralizedLinearRegressor` API from Spark and create a Pipeline using the\n",
    "`CleanMissingData` and `TrainRegressor` as pipeline stages to create and\n",
    "train the model.  Note that because `TrainRegressor` expects a `labelCol` to\n",
    "be set, there is no need to set `linkPredictionCol` when setting up the\n",
    "`GeneralizedLinearRegressor`.  Fitting the pipe on the training dataset will\n",
    "train the model.  Applying the `transform()` of the pipe to the test dataset\n",
    "creates the predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# train Poisson Regression Model\n",
    "from pyspark.ml.regression import GeneralizedLinearRegression\n",
    "from pyspark.ml import Pipeline\n",
    "from synapse.ml.train import TrainRegressor\n",
    "\n",
    "glr = GeneralizedLinearRegression(family=\"poisson\", link=\"log\")\n",
    "poissonModel = TrainRegressor().setModel(glr).setLabelCol(\"price\").setNumFeatures(256)\n",
    "poissonPipe = Pipeline(stages=[cleanModel, poissonModel]).fit(train)\n",
    "poissonPrediction = poissonPipe.transform(test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, repeat these steps to create a Random Forest Regression model using the\n",
    "`RandomRorestRegressor` API from Spark."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# train Random Forest regression on the same training data:\n",
    "from pyspark.ml.regression import RandomForestRegressor\n",
    "\n",
    "rfr = RandomForestRegressor(maxDepth=30, maxBins=128, numTrees=8, minInstancesPerNode=1)\n",
    "randomForestModel = TrainRegressor(model=rfr, labelCol=\"price\", numFeatures=256).fit(\n",
    "    train\n",
    ")\n",
    "randomForestPipe = Pipeline(stages=[cleanModel, randomForestModel]).fit(train)\n",
    "randomForestPrediction = randomForestPipe.transform(test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After the models have been trained and scored, compute some basic statistics\n",
    "to evaluate the predictions.  The following statistics are calculated for\n",
    "regression models to evaluate:\n",
    "* Mean squared error\n",
    "* Root mean squared error\n",
    "* R^2\n",
    "* Mean absolute error\n",
    "\n",
    "Use the `ComputeModelStatistics` API to compute basic statistics for\n",
    "the Poisson and the Random Forest models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from synapse.ml.train import ComputeModelStatistics\n",
    "\n",
    "poissonMetrics = ComputeModelStatistics().transform(poissonPrediction)\n",
    "print(\"Poisson Metrics\")\n",
    "poissonMetrics.toPandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "randomForestMetrics = ComputeModelStatistics().transform(randomForestPrediction)\n",
    "print(\"Random Forest Metrics\")\n",
    "randomForestMetrics.toPandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also compute per instance statistics for `poissonPrediction`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from synapse.ml.train import ComputePerInstanceStatistics\n",
    "\n",
    "\n",
    "def demonstrateEvalPerInstance(pred):\n",
    "    return (\n",
    "        ComputePerInstanceStatistics()\n",
    "        .transform(pred)\n",
    "        .select(\"price\", \"prediction\", \"L1_loss\", \"L2_loss\")\n",
    "        .limit(10)\n",
    "        .toPandas()\n",
    "    )\n",
    "\n",
    "\n",
    "demonstrateEvalPerInstance(poissonPrediction)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and with `randomForestPrediction`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "demonstrateEvalPerInstance(randomForestPrediction)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
